home *** CD-ROM | disk | FTP | other *** search
- "FTXT" IFF Formatted Text
-
- Date: November 15, 1985
- From: Steve Shaw and Jerry Morrison, Electronic Arts and
- Bob "Kodiak" Burns, Commodore-Amiga
- Status: Draft 2.6
-
- DRAFT DRAFT DRAFT
- DRAFT DRAFT DRAFT
-
- 1. Introduction
-
- This memo is the IFF supplement for FORM FTXT. An FTXT is an IFF "data
- section" or "FORM type" which can be an IFF file or a part of one containing
- a stream of text plus optional formatting information."EA IFF 85"
- is Electronic Arts' standard for interchange format files. (See the
- IFF reference.)
-
- An FTXT is an archival and interchange representation designed for
- three uses. The simplest use is for a "console device" or "glass teletype"
- (the minimal 2-D text layout means): a stream of "graphic" ("printable")
- characters plus positioning characters "space" ("SP") and line terminator
- ("LF"). This is not intended for cursor movements on a screen although
- it does not conflict with standard cursor-moving characters. The second
- use is text that has explicit formatting information (or "looks")
- such as font family and size, typeface, etc. The third use is as the
- lowest layer of a structured document that also has "inherited" styles
- to implicitly control character looks. For that use, FORMs FTXT would
- be embedded within a future document FORM type. The beauty of FTXT
- is that these three uses are interchangeable, that is, a program written
- for one purpose can read and write the others' files. So a word processor
- does not have to write a separate plain text file to communicate with
- other programs.
-
- Text is stored in one or more "CHRS" chunks inside an FTXT. Each CHRS
- contains a stream of 8-bit text compatible with ISO and ANSI data
- interchange standards. FTXT uses just the central character set from
- the ISO/ANSI standards. (These two standards are henceforth called
- "ISO/ANSI" as in "see the ISO/ANSI reference".)
-
- Since it's possible to extract just the text portions from future
- document FORM types, programs can exchange data without having to
- save both plain text and formatted text representations.
-
- Character looks are stored as embedded control sequences within CHRS
- chunks. This document specifies which class of control sequences to
- use: the CSI group. This document does not yet specify their meanings,
- e.g. which one means "turn on italic face". Consult ISO/ANSI.
-
- Section 2 defines the chunk types character stream "CHRS" and font
- specifier "FONS". These are the "standard" chunks. Specialized chunks
- for private or future needs can be added later. Section 3 outlines
- an FTXT reader program that strips a document down to plain unformatted
- text. Appendix A is a code table for the 8-bit ISO/ANSI character
- set used here. Appendix B is an example FTXT shown as a box diagram.
- Appendix C is a racetrack diagram of the syntax of ISO/ANSI control
- sequences.
-
-
- Reference:
-
- Amiga[tm] is a trademark of Commodore-Amiga, Inc.
-
- Electronic Arts[tm] is a trademark of Electronic Arts.
-
- IFF: "EA IFF 85" Standard for Interchange Format Files describes the
- underlying conventions for all IFF files.
-
- ISO/ANSI: ISO/DIS 6429.2 and ANSI X3.64-1979. International Organization
- for Standardization (ISO) and American National Standards Institute
- (ANSI) data-interchange standards. The relevant parts of these two
- standards documents are identical. ISO standard 2022 is also relevant.
-
-
- 2. Standard Data and Property Chunks
-
- The main contents of a FORM FTXT is in its character stream "CHRS"
- chunks. Formatting property chunks may also appear. The only formatting
- property yet defined is "FONS", a font specifier. A FORM FTXT with
- no CHRS represents an empty text stream. A FORM FTXT may contain nested
- IFF FORMs, LISTs, or CATs, although a "stripping" reader (see section
- 3) will ignore them.
-
- Character Set
-
- FORM FTXT uses the core of the 8-bit character set defined by the
- ISO/ANSI standards cited at the start of this document. (See Appendix
- A for a character code table.) This character set is divided into
- two "graphic" groups plus two "control" groups. Eight of the control
- characters begin ISO/ANSI standard control sequences. (See "Control
- Sequences", below.) Most control sequences and control characters
- are reserved for future use and for compatibility with ISO/ANSI. Current
- reader programs should skip them.
-
- % C0 is the group of control characters in the range NUL (hex
- 0) through hex 1F. Of these, only LF (hex 0A) and ESC (hex 1B) are
- significant. ESC begins a control sequence. LF is the line terminator,
- meaning "go to the first horizontal position of the next line". All
- other C0 characters are not used. In particular, CR (hex 0D) is not
- recognized as a line terminator.
-
- % G0 is the group of graphic characters in the range hex 20 through
- hex 7F. SP (hex 20) is the space character. DEL (hex 7F) is the delete
- character which is not used. The rest are the standard ASCII printable
- characters "!" (hex 21) through "~" (hex 7E).
-
- % C1 is the group of extended control characters in the range
- hex 80 through hex 9F. Some of these begin control sequences. The
- control sequence starting with CSI (hex 9B) is used for FTXT formatting.
- All other control sequences and C1 control characters are unused.
-
- % G1 is the group of extended graphic characters in the range
- NBSP (hex A0) through "X" (hex FF). It is one of the alternate graphic
- groups proposed for ISO/ANSI standardization.
-
- Control Sequences
-
- Eight of the control characters begin ISO/ANSI standard "control sequences"
- (or "escape sequences"). These sequences are described below and diagrammed
- in Appendix C.
-
- G0 ::= (SP through DEL)
- G1 ::= (NBSP through "X")
-
- ESC-Seq ::= ESC (SP through "/")* ("0" through "~")
- ShiftToG2 ::= SS2 G0
- ShiftToG3 ::= SS3 G0
- CSI-Seq ::= CSI (SP through "?")* ("@" through "~")
- DCS-Seq ::= (DCS | OSC | PM | APC) (SP through "~" | G1)* ST
-
- "ESC-Seq" is the control sequence ESC (hex 1B), followed by zero or
- more characters in the range SP through "/S (hex 20 through hex 2F),
- followed by a character in the range "0" through "~" (hex 30 through
- hex 7E). These sequences are reserved for future use and should be
- skipped by current FTXT reader programs.
-
- SS2 (hex 8E) and SS3 (hex 8F) shift the single following G0 character
- into yet-to-be-defined graphic sets G2 and G3, respectively. These
- sequences should not be used until the character sets G2 and G3 are
- standardized. A reader may simply skip the SS2 or SS3 (taking the
- following character as a corresponding G0 character) or replace the
- two-character sequence with a character like "?" to mean "absent".
-
- FTXT uses "CSI-Seq" control sequences to store character formatting
- (font selection by number, type face, and text size) and perhaps layout
- information (position and rotation). "CSI-Seq" control sequences start
- with CSI (the "control sequence introducer", hex 9B). Syntactically,
- the sequence includes zero or more characters in the range SP through
- "?" (hex 20 through hex 3F) and a concluding character in the range
- "@" through "~" (hex 40 through hex 7E). These sequences may be skipped
- by a minimal FTXT reader, i.e. one that ignores formatting information.
-
- Note: A future FTXT standardization document will explain the uses
- of CSI-Seq sequences for setting character face (light weight vs.
- medium vs. bold, italic vs. upright, height, pitch, position, and
- rotation). For now, consult the ISO/ANSI references.
-
- "DCS-Seq" is the control sequences starting with DCS (hex 90), OSC
- (hex 9D), PM (hex 9E), or APC (hex 9F), followed by zero or more characters
- each of which is in the range SP through "~" (hex 20 through hex 7E)
- or else a G1 character, and terminated by an ST (hex 9C). These sequences
- are reserved for future use and should be skipped by current FTXT
- reader programs.
-
- Data Chunk CHRS
-
- A CHRS chunk contains a sequence of 8-bit characters abiding by the
- ISO/ANSI standards cited at the start of this document. This includes
- the character set and control sequences as described above and summarized
- in Appendicies A and C.
-
- A FORM FTXT may contain any number of CHRS chunks. Taken together,
- they represent a single stream of textual information. That is, the
- contents of CHRS chunks are effectively concatenated except that (1)
- each control sequence must be completely within a single CHRS chunk,
- and (2) any formatting property chunks appearing between two CHRS
- chunks affects the formatting of the latter chunk's text. Any formatting
- settings set by control sequences inside a CHRS carry over to the
- next CHRS in the same FORM FTXT. All formatting properties stop at
- the end of the FORM since IFF specifies that adjacent FORMs are independent
- of each other (although not independent of any properties inherited
- from an enclosing LIST or FORM).
-
- Property Chunk FONS
-
- The optional property "FONS" holds a FontSpecifier as defined in the
- C declaration below. It assignes a font to a numbered "font register"
- so it can be referenced by number within subsequent CHRS chunks. (This
- function is not provided within the ISO and ANSI standards.) The font
- specifier gives both a name and a description for the font so the
- recipient program can do font substitution.
-
- By default, CHRS text uses font 1 until it selects another font. A
- minimal text reader always uses font 1. If font 1 hasn't been specified,
- the reader may use the local system font as font 1.
-
- typedef struct {
- UBYTE id;
- /* 0 through 9 is a font id number referenced by an
- * SGR control sequence selective parameter of 10
- * through 19. Other values are reserved for future
- * standardization.
- */
- UBYTE pad1; /* reserved for future use; store 0 here */
- UBYTE proportional;
- /* proportional font? 0 = unknown, 1 = no, 2 = yes */
- UBYTE serif;
- /* serif font? 0 = unknown, 1 = no, 2 = yes */
- char name[];
- /* A NULL-terminated string naming preferred font. */
- } FontSpecifier;
-
- Fields are filed in the order shown. The UBYTE fields are byte-packed
- (2 per 16-bit word). The field pad1 is reserved for future standardization.
- Programs should store 0 there for now.
-
- The field proportional indicates if the desired font is proportional
- width as opposed to fixed width. The field serif indicates if the
- desired font is serif as opposed to sans serif. [Issue: Discuss font
- substitution!]
-
- Future Properties
-
- New optional property chunks may be defined in the future to store
- additional formatting information. They will be used to represent
- formatting not encoded in standard ISO/ANSI control sequences and
- for "inherited" formatting in structured documents. Text orientation
- might be one example.
-
- Positioning Units
-
- Unless otherwise specified, position and size units used in FTXT formatting
- properties and control sequences are in decipoints (720 decipoints/inch).
- This is ANSI/ISO Positioning Unit Mode (PUM) 2. While a metric standard
- might be nice, decipoints allow the existing U.S.A. typographic units
- to be encoded easily, e.g. "12 points" is "120 decipoints".
-
-
- 3. FTXT Stripper
-
- An FTXT reader program can read the text and ignore all formatting
- and structural information in a document FORM that uses FORMs FTXT
- for the leaf nodes. This amounts to stripping a document down to a
- stream of plain text. It would do this by skipping over all chunks
- except FTXT.CHRS (CHRS chunks found inside a FORM FTXT) and within
- the FTXT.CHRS chunks skipping all control characters and control sequences.
- (Appendix C diagrams this text scanner.) It may also read FTXT.FONS
- chunks to find a description for font 1.
-
- Here's a Pascal-ish program for an FTXT stripper. Given a FORM (a
- document of some kind), it scans for all FTXT.CHRS chunks. This would
- likely be applied to the first FORM in an IFF file.
-
- PROCEDURE ReadFORM4CHRS(); {Read an IFF FORM for FTXT.CHRS chunks.}
- BEGIN
- IF the FORM's subtype = "FTXT"
- THEN ReadFTXT4CHRS()
- ELSE WHILE something left to read in the FORM DO BEGIN
- read the next chunk header;
- CASE the chunk's ID OF
- "LIST", "CAT ": ReadCAT4CHRS();
- "FORM": ReadFORM4CHRS();
- OTHERWISE skip the chunk's body;
- END
- END
- END;
-
- {Read a LIST or CAT for all FTXT.CHRS chunks.}
- PROCEDURE ReadCAT4CHRS();
- BEGIN
- WHILE something left to read in the LIST or CAT DO BEGIN
- read the next chunk header;
- CASE the chunk's ID OF
- "LIST", "CAT ": ReadCAT4CHRS();
- "FORM": ReadFORM4CHRS();
- "PROP": IF we're reading a LIST AND the PROP's subtype =
- "FTXT"
- THEN read the PROP for "FONS" chunks;
- OTHERWISE error--malformed IFF file;
- END
- END
- END;
-
- PROCEDURE ReadFTXT4CHRS(); {Read a FORM FTXT for CHRS chunks.}
- BEGIN
- WHILE something left to read in the FORM FTXT DO BEGIN
- read the next chunk header;
- CASE the chunk's ID OF
- "CHRS": ReadCHRS();
- "FONS": BEGIN
- read the chunk's contents into a FontSpecifier variable;
- IF the font specifier's id = 1 THEN use this font;
- END;
- OTHERWISE skip the chunk's body;
- END
- END
- END;
-
- {Read an FTXT.CHRS. Skip all control sequences and unused control
- chars.}
- PROCEDURE ReadCHRS();
- BEGIN
- WHILE something left to read in the CHRS chunk DO
- CASE read the next character OF
- LF: start a new output line;
- ESC: SkipControl([' '..'/'], ['0'..'~']);
- IN [' '..'~'], IN [NBSP..'X']: output the character;
- SS2, SS3: ; {Just handle the following G0 character
- directly, ignoring the shift to G2 or G3.}
- CSI: SkipControl([' '..'?'], ['@'..'~']);
- DCS, OSC, PM, APC: SkipControl([' '..'~'] + [NBSP..'X'], [ST]);
- END
- END;
-
- {Skip a control sequence of the format (rSet)* (tSet), i.e. any number
- of characters in the set rSet followed by a character in the set tSet.}
- PROCEDURE SkipControl(rSet, tSet);
- VAR c: CHAR;
- BEGIN
- REPEAT c := read the next character
- UNTIL c NOT IN rSet;
- IF c NOT IN tSet
- THEN put character c back into the input stream;
- END
-
- The following program is an optimized version of the above routines
- ReadFORM4CHRS and ReadCAT4CHRS for the case where you're ignoring
- fonts as well as formatting. It takes advantage of certain facts of
- the IFF format to read a document FORM and its nested FORMs, LISTs,
- and CATs without a stack. In other words, it's a hack that ignores
- all fonts and faces to cheaply get to the plain text of the document.
-
- {Cheap scan of an IFF FORM for FTXT.CHRS chunks.}
- PROCEDURE ScanFORM4CHRS();
- BEGIN
- IF the document FORM's subtype = "FTXT"
- THEN ReadFTXT4CHRS()
- ELSE WHILE something left to read in the FORM DO BEGIN
- read the next chunk header;
- IF it's a group chunk (LIST, FORM, PROP, or CAT)
- THEN read its subtype ID;
- CASE the chunk's ID OF
- "LIST", "CAT ":; {NOTE: See explanation below.*}
- "FORM": IF this FORM's subtype = "FTXT" THEN
- ReadFTXT4CHRS()
- ELSE; {NOTE: See explanation below.*}
- OTHERWISE skip the chunk's body;
- END
- END
- END;
-
- *Note: This implementation is subtle. After reading a group header
- other than FORM FTXT it just continues reading. This amounts to reading
- all the chunks inside that group as if they weren't nested in a group.
-
-
- Appendix A: Character Code Table
-
- This table corresponds to the ISO/DIS 6429.2 and ANSI X3.64-1979 8-bit
- character set standards. Only the core character set of those standards
- is used in FTXT.
-
- Two G1 characters aren't defined in the standards and are shown as
- dark gray entries in this table. Light gray shading denotes control
- characters. (DEL is a control character although it belongs to the
- graphic group G0.) The following five rare G1 characters are left
- blank in the table below due to limitations of available fonts: hex
- A8, D0, DE, F0, and FE.
-
-
-
- ISO/DIS 6429.2 and ANSI X3.64-1979 Character Code Table
-
-
- (figure named "TextTable", viewable by ShowILBM or SeeILBM)
-
-
- [_____] [_______________________] [_____] [____________________________]
- Control Grapic Group Control Graphic Group
- Group G0 Group G1
- C0 C1
-
- "NBSP" is a "non-breaking space"
- "SHY" is a "soft-hyphen"
-
-
-
- Appendix B. FTXT Example
-
- Here's a box diagram for a simple example: "The quick brown fox jumped.Four
- score and seven", written in a proportional serif font named "Roman".
-
-
- +-----------------------------------+
- |'FORM' 24070 | FORM 24070 ILBM
- +-----------------------------------+
- |'ILBM' |
- +-----------------------------------+
- | +-------------------------------+ |
- | | 'BMHD' 20 | | .BMHD 20
- | | 320, 200, 0, 0, 3, 0, 0, ... | |
- | | ------------------------------+ |
- | | 'CMAP' 21 | | .CMAP 21
- | | 0, 0, 0; 32, 0, 0; 64,0,0; .. | |
- | +-------------------------------+ |
- | +-------------------------------+ |
- | |'BODY' 24000 | | .BODY 24000
- | |0, 0, 0, ... | |
- | +-------------------------------+ |
- +-----------------------------------+
-
- The "0" after the CMAP chunk is a pad byte.
-
-
-
-
- Appendix B. Standards Committee
-
- The following people contributed to the design of this IFF standard:
-
- Bob "Kodiak" Burns, Commodore-Amiga
- R. J. Mical, Commodore-Amiga
- Jerry Morrison, Electronic Arts
- Greg Riker, Electronic Arts
- Steve Shaw, Electronic Arts
- Barry Walsh, Commodore-Amiga
-
-
-
- Appendix C. ISO/ANSI Control Sequences
-
- This is a racetrack diagram of the ISO/ANSI characters and control
- sequences as used in FTXT CHRS chunks.
-
- line terminator
- -----+-------------------> LF --------------------------------------->
- | ESC-Seq
- +-------------------> ESC ---+>----------------+--> 0 thru ~ --->
- | | |
- | +-- SP thru / <---+
- | printable
- +---------------+---> SP thru ~ --+->--------------------------->
- | | |
- | +---> G1 -------->+
- | shift to G2
- +-------------------> SS2 ----> G0 ---> (produces a G2 character)
- | shift to G3
- +-------------------> SS3 ----> G0 ---> (produces a G3 character)
- | CSI-Seq
- +-------------------> CSI ---+>----------------+--> @ thru ~ --->
- | | |
- | +-- SP thru ? <---+
- | DCS-Seq
- +----------> DCS,OSC,PM,or APC --+>-------------+--+-> ST -+---->
- | | | | |
- | +- SP thru ~ <-+ +-> G1 -+
- | discard
- +----------> any other character ------------------------------->
-
-
-
- Of the various control sequences, only CSI-Seq is used for FTXT character
- formatting information. The others are reserved for future use and
- for compatibility with ISO/ANSI standards. Certain character sequences
- are syntactically malformed, e.g. CSI followed by a C0, C1, or G1
- character. Writer programs should not generate reserved or malformed
- sequences and reader programs should skip them.
-
- Consult the ISO/ANSI standards for the meaning of the CSI-Seq control
- sequences.
-
- The two character set shifts SS2 and SS3 may be used when the graphic
- character groups G2 and G3 become standardized.
-
-